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Abstract - Reinforcement learning means learning 
a policy — a mapping of observations into actions — 
based on feedback from the environment. The learn- 
ing can be viewed as browsing a set of policies while 
evaluating them by trial through interaction with the 
environment. We present an application of gradient 
ascent algorithm for reinforcement learning to a com- 
plex domain of packet routing in network communica- 
tion and compare the performance of this algorithm 
to other routing methods on a benchmark problem. 

I. Introduction 

Successful telecommunication requires efficient re- 
source allocation that can be achieved by developing 
adaptive control policies. Reinforcement learning 
(rl) [TU], [Uj presents a natural framework for the 
development of such policies by trial and error in the 
process of interaction with the environment. In this 
work we apply the RL algorithm to network routing. 
Effective network routing means selecting the optimal 
communication paths. It can be modeled as a multi- 
agent rl problem. In a sense, learning the optimal 
control for network routing could be thought of as 
learning in some traditional for rl episodic task, like 
maze searching or pole balancing, but repeating trials 
many times in parallel with interaction among trials. 

Under this interpretation, an individual router is an 
agent which makes its routing decisions according to an 
individual policy. The parameters of this policy are ad- 
justed according to some measure of the global perfor- 
mance of the network, while control is determined by 
local observations. Nodes do not have any information 
regarding the topology of network or their position in it. 
The initialization of each node, as well as the learning 
algorithm it follows, are identical to that of every other 
node and independent of the structure of the network. 
There is no notion of orientation in space or other se- 
mantics of actions. Our approach allows us to update the 
local policies while avoiding the necessity for centralized 
control or global knowledge of the networks structure. 
The only global information required by the learning al- 
gorithm is the network utility expressed as a reward sig- 
nal distributed once in an epoch and dependent on the 
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average routing time. This learning multi-agent system 
is biologically plausible and could be thought of as neu- 
ral network in which each neuron only performs simple 
computations based on locally available quantities [?] . 

II. Domain 

We test our algorithm on a domain adopted from 
Boyan and Liftman It is a discrete time simula- 
tor of communication networks with various topologies 
and dynamic structure. A communication network is an 
abstract representation of real-life systems such as the 
Internet or a transport network. It consists of a homo- 
geneous set of nodes and edges between them represent- 
ing links (see Figure [T]). Nodes linked to each other are 
called neighbors. Links may be active ("up") or inac- 
tive ("down"). Each node can be the origin or the final 
destination of packets, or serve as a router. 

Packets are periodically introduced into the network 
with a uniformly random node of origin and destination. 
They travel to their destination node by hopping on inter- 
mediate nodes. No packets are generated being destined 
to the node of origin. Sending a packet down a link in- 
curs a cost that could be thought of as time in transition. 
There is an added cost to waiting in the queue of a par- 
ticular node in order to access the router's computational 
resource — a queue delay. Both costs are assumed to be 
uniform throughout the network. In our experiments, 
each is set to be a unit cost. The level of network traffic 
is determined by the number of packets in the network. 
Once a packet reaches its destination, it is removed. If a 
packet has been traveling around the network for a long 
time it is also removed as a hopeless case. Multiple pack- 
ets line up at nodes in an FIFO (first in first out) queue 
limited in size. The node must forward the top packet in 
the FIFO queue to one of its neighbors. 

In the terminology of RL, the network represents the 
environment whose state is determined by the number 
and relative position of nodes, the status of links be- 
tween them and the dynamics of packets. The destina- 
tion of handled packets and the status of local links form 
the node's observation. Each node is an agent who has 
a choice of actions. It decides where to send the packet 
according to a policy. The policy computed by our algo- 
rithm is stochastic, as opposed to deterministic, i.e. it 



sends packets bound for the same destination down differ- 
ent links, according to some distribution. The policy con- 
sidered in our experiments does not determine whether 
or not to accept a packet (admission control) , how many 
packets to accept from each neighbor, or which packets 
should be assigned priority. 

The node updates the parameters of its policy based 
on the reward. The reward comes in the form of a sig- 
nal distributed through the network by acknowledgment 
packets once a packet has reached its final destination. 
The reward depends on the total delivery time for the 
packet. We measure the performance of the algorithm 
by the average delivery time for packets once the system 
has settled on a policy (ordinate axes on figure [5]). We 
apply policy shaping by explicitly penalizing loops in the 
route. Each packet is assumed to carry some elements 
of its routing history in addition to obvious destination 
and origin information. They include the time when the 
packet was generated, the time the packet last received 
attention from some router, the trace of recently visited 
nodes and the number of hops performed so far. In case 
a packet is detected to have spent too much time in the 
network failing to reach its destination, such packet is dis- 
carded and the network is penalized accordingly. Thus, 
a defining factor in our simulation is weather the num- 
ber of hops performed by a packet is more than a total 
number of nodes in the network. 

III. Algorithmic details 

Williams introduced the notion of policy search via 
gradient ascent for reinforcement learning in his rein- 
force algorithm [18], [19], which was generalized to a 
broader class of error criteria by Baird and Moore [1] , [2] . 
The general idea is to adjust parameters in the direction 
of the empirically estimated gradient of the aggregate re- 
ward. We assume standard Markov decision process mdp 
setup [10] . Let us consider the case of a single agent in- 
teracting with a partially observable mdp (pomdp). The 
agent's policy fi is a so-called reactive policy represented 
by a lookup table with a value 9 oa for each observation- 
action (destination/link) pair. The policy defines the 
probability of an action given past history as a continuous 
differentiable function of a set of parameters 9 according 
to a softmax rule, where S is a temperature parameter: 
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This rule assures that for any destination o any link 
a 1 available at the node is sometimes chosen with some 
small probability dependent on the temperature S. 

We denote by H t the set of all possible experience se- 
quences h = (o(l), a(l), r(l), . . ., o(t), a(t), r(t), o{t + 1)) of 
length t. In order to specify that some element is a 
part of the history h at time r, we write, for exam- 



ple, r(r,h) and a(r, h) for the r th reward and action 
in the history h. We will also use h T to denote a pre- 
fix of the sequence h G H t truncated at time r < t : 
h T = (o(l), a(l), r(l), . . . , o(T),a(T),r(T), o(r + 1)}. The 
value of following a policy /i with parameters 9 is the ex- 
pected cumulative discounted (by a factor of 7 6 [0, 1)) 
reward that can be written as 
00 
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If we could calculate the derivative of V(9) for each 
9 oa , it would be possible to do an exact gradient ascent 
on value V() by making updates A9 oa — a-^—V(9) for 
some step size a. Let us analyze the derivative for each 
weight 9 oa , 
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However, in the spirit of reinforcement learning, we as- 
sume no knowledge of a world model that would allow the 
agent to calculate Pr(/i 1 9) , so we must retreat to stochas- 
tic gradient ascent instead. We sample from the distribu- 
tion of histories by interacting with the environment, and 
calculate during each trial an estimate of the gradient, ac- 
cumulating the quantities: 7*r(t, h) E*=i dln ^ a '°' S \ for 
all t. For a particular policy architecture, this can be 
readily translated into a gradient ascent algorithm guar- 
anteed to converge to a local optimum 9* of V(9). Under 
our chosen policy encoding we get: 
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Applying this algorithm in a network of connected con- 
trollers basically constitutes the algorithm of routing by 
distributed gradient ascent policy search (gaps). 

We compare the performance of our distributed GAPS 
algorithm to three others, as follows. "Best" is a static 
routing scheme based on the shortest path counting each 
link as a single unit of routing cost. We include this al- 
gorithm because it provides the basis for most current 
industry routing heuristics [3J, [H]- "Bestload" performs 
routing according to the shortest path while taking into 
account queue sizes at each node. It is close to the theo- 
retical optimum among deterministic routing algorithms 
even though the actual best possible routing scheme re- 
quires not simply computing the shortest path based on 



network loads, but also analyzing how loads change in 
time according to routing decisions. Since calculating 
the shortest path at every single step of the simulation 
would be prohibitively costly in terms of computational 
resources, we implemented "Bcstload" by readjusting the 
routing policy only after a notable change of loads in the 
network. We consider 50 successfully delivered packets to 
constitute a notable load change. Finally, "Q-routing" is 
a distributed RL algorithm applied specifically to this do- 
main by Littman and Boyan 4 . While our algorithm is 
stochastic and performs policy search, Q-routing is a de- 
terministic, value search algorithm. Note that our imple- 
mentation of the network routing simulation is based on 
the software Littman and Boyan used to test Q-routing. 
Even so,the results of our simulation of "Q-routing" and 
"Best" on the "6x6" network differ slightly from Littman 
and Boyan's due to certain modifications in traffic mod- 
eling conventions. For instance, we consider a packet 
delivered and ready for removal only after it has passed 
through the queue of the destination node and accessed 
its computational resources, and not merely when the 
packet is successfully routed to the destination node by 
an immediate neighbor, as in the original simulation. 

We undertake the comparison between GAPS and the 
aforementioned algorithms with one important caveat. 
The GAPS algorithm explores the class of stochastic poli- 
cies while all other methods pick deterministic routing 
policies. Consequently, it is natural to expect GAPS to 
be superior for certain types of network topologies and 
loads, where the optimal policy is stochastic. Later, we 
show that our experiments confirm this expectation. 

We implement the distributed GAPS in POMDP. In par- 
ticular, we represent each router as a POMDP, where the 
state contains the sizes of all queues, destinations of all 
packets, state of links (up or down); the environment 
state transition function is a law of the dynamics of net- 
work traffic; an observation o consists of the destination 
of the packet; an action a corresponds to sending the 
packet down a link to an adjacent node; and finally, the 
reward signal is the average number of packets delivered 
per unit of time. Each agent is using a GAPS RL algorithm 
to move parameterization values down the gradient of the 
average reward. It has been shown [14) that an applica- 
tion of distributed GAPS causes the system as a whole 
to converge to local optimum under stationarity assump- 
tions. This algorithm is essentially the one described in 
chapter 3 and developed in chapter 5 of Peshkin's disser- 
tation [13]. 

Policies were initialized in two different ways: ran- 
domly and based on shortest paths. We tried initial- 
ization with random policy uniformly chosen over pa- 
rameter space. With such initialization results are very 



sensitive to the learning rate. High learning rate often 
causes the network to stick to local optima in combined 
policy space, with very poor performance. Low learning 
rate results in a slow convergence. What constitutes high 
or low learning rate depends on the specifics of each net- 
work and we did not find any satisfactory heuristics to set 
it. Obviously, such features as average number of hops 
necessary to deliver a packet under the optimal policy as 
well as learning speed crucially depend on the particular 
characteristics of each network such as number of nodes, 
connectivity and modularity. 

These considerations led us to a different way of ini- 
tializing controllers. Namely, we begin by computing 
shortest path and set controllers to route most of the 
traffic down the shortest path, while occasionally send- 
ing a packet to explore an alternative link. We call this 
"e-greedy routing". In our experiments, e is set to .01. 
We believe that this parameter would not qualitatively 
change the outcome of our experiments since it only in- 
fluences exploratory behaviour in the beginning. 

The exploration capacity of the algorithm is regulated 
in a different way as well. Both temperature and learning 
rate are simply kept constant both for considerations of 
simplicity and for maintaining the controllers' ability to 
adjust to changes in the network, such as links failure. 
However, our experiments indicate that having a schedule 
for reducing learning rate after a key initial period of 
learning would improve performance. Alternatively, it 
would be interesting to explore different learning rates for 
the routing parameters on one hand, and the encoding 
of topological features on the other. 

IV. Empirical results 

We compared the routing algorithms on several net- 
works with various number of nodes and degrees of con- 
nectivity and modularity, including 116-node "lata" 
telephone network. On all networks, the GAPS algorithm 
performed comparably or better than other routing al- 
gorithms. To illustrate the principal differences in the 
behavior of algorithms and the key advantages of dis- 
tributed GAPS, we concentrate on the analysis of two 
routing problems on networks which differ in a single 
link location. 

Figure [TJleft presents the irregular 6x6 grid network 
topology used by Boyan and Littman [1] in their experi- 
ments. The network consists of two well connected com- 
ponents with a bottleneck of traffic falling on two bridg- 
ing links. The resulting dependence of network perfor- 
mance on the load is depicted in figure El left. All graphs 
represent performance after the policy has converged, av- 
eraged over five runs. We tested the network on loads 
ranging from .5 to 3.5, to compare with the results ob- 




tained by Littman and Boyan. The load corresponds to 
the value of the parameter of Poisson arrival process for 
the average number of packets injected per time unit. On 
this network topology, GAPS is slightly inferior to other 
algorithms on lower loads, but does at least as well as 
Bestload on higher loads, outperforming both Q-routing 
and Best. The slightly inferior performance on low loads 
is due to exploratory behaviour of GAPS — some fraction 
of packets is always sent down random link. 

To illustrate the difference between the algorithms 
more explicitly, we altered the network by moving just 
one link from connecting nodes 32 and 33, to connecting 
nodes 20 and 27 as illustrated by figure [T]right. Since 
node 20 obviously represents a bottleneck in this configu- 
ration, the optimal routing policy is bound to be stochas- 
tic. The resulting dependence of network performance on 
the load is presented in figure Ejright. GAPS is clearly su- 
perior to other algorithms at high loads. It even outper- 
forms "Bestload" that has all the global information in 
choosing a policy, but is bound to deterministic policies. 
Notice how the deterministic algorithms get frustrated at 
much lower loads in this network configuration than in 
the previous one since from their perspective, the bridge 
between highly connected components gets twice thinner 
(compare left and right of Figure [2]). 

The GAPS algorithm successfully adapts to changes in 
the network configuration. Under increased load, the pre- 
ferred route from the left part of the network to the right 
becomes evenly split between the two "bridges" at node 
20. By using link 20 — 27, the algorithm has to pay a 
penalty of making a few extra hops compared to link 
20 — 21, but as the size of the queue at node 21 grows, 
this penalty becomes negligible compared to the waiting 
time. Exploratory behavior helps GAPS discover when 
links go down and adjust the policy accordingly. We 



have experimented with giving each router a few bits of 
memory in finite state controller [13 , [?] but found that 
this does not improve performance and slows down the 
learning somewhat. 

V. Related Work 

The application of machine learning techniques to the 
domain of telecommunications is a rapidly growing area. 
The bulk of problems fit into the category of resource 
allocation, e.g. bandwidth allocation, network routing, 
call admission control (cac) and power management, rl 
appears promising in attacking all of these problems, sep- 
arately or simultaneously. 

Marbach, Mihatsch and Tsitsiklis [11] have applied an 
actor-critic (value-search) algorithm to address resource 
allocation within communication networks by tackling 
both routing and call admission control. They adopt a 
decompositional approach, representing the network as 
consisting of link processes, each with its own differen- 
tial reward. Unfortunately, the empirical results even on 
small networks, 4 and 16 nodes, show little advantage 
over heuristic techniques. 

Carlstrom [7] introduces another rl strategy based 
on decomposition called predictive gain scheduling. The 
control problem of admission control is decomposed into 
a time-series prediction of near-future call arrival rates 
and precomputation of control policies for Poisson call 
arrival processes. This approach results in faster learn- 
ing without performance loss. Online convergence rate 
increases 50 times on a simulated link with capacity 
24 units/sec. 

Generally speaking, value-search algorithms have been 
more extensively investigated than policy search ones 
in the domain of communications. Value-search (Q- 
learning) algorithms have arrived at promising results. 




Boyan and Littman's [4j algorithm - Q-routing, proves 
superior to non-adaptive techniques based on shortest 
path, and robust with respect to dynamic variations in 
the simulation on a variety of network topology, including 
an irregular 6x6 grid and 116-node LATA phone network. 
It regulates the trade-off between the number of nodes a 
packet has to traverse and the possibility of congestion. 

Wolpert, Turner and Frank [20] construct a formalism 
for the so-called Collective Intelligence (coiN)neural net 
applied to Internet traffic routing. The approach involves 
automatically initializing and updating the local utility 
functions of individual RL agents (nodes) from the global 
utility and observed local dynamics. Their simulation 
outperforms a Full Knowledge Shortest Path Algorithm 
on a sample network of seven nodes. Coin networks em- 
ploy a method similar in spirit to the research presented 
here. They rely on a distributed rl algorithm that con- 
verges on local optima without endowing each agent node 
with explicit knowledge of network topology. However, 
COIN differs form our approach in requiring the introduc- 
tion of preliminary structure into the network by dividing 
it into semi-autonomous neighborhoods that share a local 
utility function and encourage cooperation. In contrast, 
all the nodes in our network update their algorithms di- 
rectly from the global reward. 

The work presented in this paper focuses on packet 
routing using policy search. It resembles the work of 
Tao, Baxter and Weaver [12] who apply a policy-gradient 
algorithm to induce cooperation among the nodes of a 



packet switched network in order to minimize the aver- 
age packet delay. While their algorithm performs well in 
several network types, it takes many (tens of millions) 
trials to converge on a network of just a few nodes. 

Applying reinforcement learning to communication of- 
ten involves optimizing performance with respect to mul- 
tiple criteria. For a recent discussion on this challenging 
issue see Shelton [15]. In the context of wireless com- 
munication it was addressed by Brown [5] who considers 
the problem of finding a power management policy that 
simultaneously maximizes the revenue earned by provid- 
ing communication while minimizing battery usage. The 
problem is defined as a stochastic shortest path with dis- 
counted infinite horizon, where discount factor varies to 
model power loss. This approach resulted in significant 
(50%) improvement in power usage. 

Gelenbe et al. [9] also compute the reward as a 
weighted combination of the probability of packet loss 
and packet delay. The packets themselves are agents con- 
trolling routing and flow control in a Cognitive Packet 
Network. They split packets into three types: "smart", 
"dumb" and "acknowledgment". A small number of 
smart packets learn the most efficient ways of navigat- 
ing through the network, dumb packets simply follow the 
route taken by the smart packets, while acknowledgment 
packets travel on the inverse route of smart packets to 
provide source routing information to dumb packets. The 
division between smart and dumb packet is an explicit 
representation of the explore/exploit dilemma. Smart 



packet allow the network to adapt to structural changes 
while the dumb packets exploit the relative stability be- 
tween those changes. Promising results are obtained both 
on a simulation network of 100 nodes and on a physical 
network of 6 computers. 

Subramanian, Druschel and Chen [16J adopt an ap- 
proach from ant colonies that is very similar in spirit. 
The individual hosts in their network keep routing tables 
with the associated costs of sending a packet to other 
hosts (such as which routers it has to traverse and how 
expensive they are). These tables are periodically up- 
dated by "ants" -messages whose function is to assess the 
cost of traversing links between hosts. The ants are di- 
rected probabilistically along available paths. They in- 
form the hosts along the way of the costs associated with 
their travel. The hosts use this information to alter their 
routing tables according to an update rule. There are 
two types of ants. Regular ants use the routing tables of 
the hosts to alter the probability of being directed along 
a certain path. After a number of trials, all regular ants 
on the same mission start using the same routes. Their 
function is to allow the host tables to converge on the 
correct cost figure in case the network is stable. Uniform 
ants take any path with equal probability. They are the 
ones who continue exploring the network and assure suc- 
cessful adaptation to changes in link status or link cost. 

VI. Discussion 

Admittedly, the simulation of network routing process 
presented here is far from being realistic. A more realis- 
tic model could include such factors as non-homogeneous 
networks with regard to links bandwidth and routing 
nodes buffer size limits, collisions of packets, packet or- 
dering constraints, various costs associated with say, par- 
ticular links chosen from commercial versus government 
subnetworks, minimal Quality of Service requirements. 
Introducing priorities for individual packets brings up yet 
another set of optimization issues. However, the learning 
algorithm we applied shows promise in handling adap- 
tive telecommunication protocol and there are several 
obvious ways to develop this research. Incorporating 
domain knowledge into controller structure is one such 
direction. It would involve classifying nodes into sub- 
networks and routing packets in a hierarchical fashion. 
One step further down this line is employing learning 
algorithms for routing in ad-hoc networks. Ad- hoc net- 
works are networks where nodes are being dynamically 
introduced and terminated from the system, as well as 
existing active nodes are moving about, loosing some con- 
nections and establishing new ones. Under the realistic 
assumption that physical variations is the network are 
slower than traffic routing and evolution, adaptive rout- 



ing protocol should definitely outperform any heuristic 
pre-defined routines. We are currently pursuing this line 
of research. 
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